Open Information Extraction Using Constraints over Part-of-speech Sequences

نویسندگان

  • ALISA ZHILA
  • Alexander Gelbukh
چکیده

In 2010’s several exabytes of data are produced daily. Approximately between 1⁄5 and 1⁄3 of these data is text. To make use of such huge amounts of textual data, we need to be able to detect, to extract, to structure, and to process important information conveyed through this data flow in a fast and scalable manner. Open information extraction (Open IE) is a solution for detection, extraction, and initial structuring of information. Open IE is open-domain and relation-independent paradigm for information extraction performed in an unsupervised manner. This makes high-speed performance and scalability its main advantages, converting Open IE into a very perspective field for research and for applications to other text data processing tasks. In this work we have conducted an extensive research on various methods for Open IE and on its application to other tasks, and contributed in several ways. First, we have introduced an Open IE method requiring minimal pre-processing of input that assures its speed and robustness. Additionally, we proposed this method for Spanish language and showed it to be superior to other methods at least in one of the two aspects: either in terms of precision or in terms of robustness. As an additional contribution, we also introduced a method for performance comparison of Open IE systems implemented for different languages by comparing outputs for parallel datasets. Next, we introduced a method for Open IE with additional semantic pre-processing that allows semantic interpretation of the extracted relations, which was not possible with other Open IE methods. We showed that this method has a very high precision, although at a cost of yield. Most importantly, we demonstrated the application of this method to semantic structuring of extractions by introducing a novel procedure of extraction presentation in RDF/XML format that is a standard format maintained by W3C. Further, we showed that Open IE can serve for measuring of Web document informativeness that is one of the aspects of document quality. Not only did we show that it can be applicable to the arbitrary domain documents extracted from the Web “as is”, without additional pre-processing, we also showed that Open IE can serve for a complex text processing task that has a direct impact on an end-user. And the last but not least, we made publicly available the software and evaluation resources developed as parts of this work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Open Information Extraction from Real Internet Texts in Spanish Using Constraints over Part-Of-Speech Sequences: Problems of the Method, Their Causes, and Ways for Improvement

Usually we do not know the domain of an arbitrary text from the Internet, or the semantics of the relations it conveys. While humans identify such information easily, for a computer this task is far from straightforward. The task of detecting relations of arbitrary semantic type in texts is known as Open Information Extraction (Open IE). The approach to this task based on heuristic constraints ...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Prototype-Driven Learning for Sequence Models

We investigate prototype-driven learning for primarily unsupervised sequence modeling. Prior knowledge is specified declaratively, by providing a few canonical examples of each target annotation label. This sparse prototype information is then propagated across a corpus using distributional similarity features in a log-linear generative model. On part-of-speech induction in English and Chinese,...

متن کامل

Maximum Entropy Markov Models for Information Extraction and Segmentation

Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling sequential data, and have been applied with success to many text-related tasks, such as part-of-speech tagging, text segmentation and information extraction. In these cases, the observations are usually modeled as multinomial distributions over a discrete vocabulary, and the HMM parameters are set to maximize the likelih...

متن کامل

Effectiveness and Efficiency of Open Relation Extraction

A large number of Open Relation Extraction approaches have been proposed recently, covering a wide range of NLP machinery, from “shallow” (e.g., part-of-speech tagging) to “deep” (e.g., semantic role labeling–SRL). A natural question then is what is the tradeoff between NLP depth (and associated computational cost) versus effectiveness. This paper presents a fair and objective experimental comp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014